From Marbles to Daxes

An Introduction to HBMs and their Application to Category Learning”

Jan Luca Schnatz

Hierarchical Beta-Binomial Model

Motivating Example

Repeatedly draw from bags of black and white marbles with unknown proportion of black marbles:

  1. What color woud you predict for the next marble drawn from bag 8?
  2. How did you arrive at that prediction?

Intuition

  • One black marble alone gives little information about the color future marbles
  • But seeing many mostly-black or mostly-white bags before makes that single black marble highly informative

\(\rightarrow\) High chance of next marbles also being black!

Intuition

  • Hierarchy
    • Information is shared across bags at higher levels
    • Observations from previous bags shape strong priors
    • These priors influence predictions about new bags

Goal

We want to build a Bayesian model that reverse-engineers the mind‘s reasoning about color distributions across bags.

Formalizing the Problem

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Formalizing the Problem

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Formalizing the Problem

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Formalizing the Problem

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

  • \(\theta_i\): Probability of drawing a black marble in bag \(i\)
  • Different bags can have different probabilities \(\theta_i\)

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)


Formalizing the Problem

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Level 3 – General knowledge about bags

\(\theta_i \sim \text{Beta}(\alpha, \beta)\)

Formalizing the Problem

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Level 3 – General knowledge about bags

Reparamerization as

  • Expected value \(\frac{\alpha}{\alpha + \beta}\) of \(\theta_i\)
  • Precision \(\alpha + \beta\) capturing concentration of probability mass around its mean (inverse of variance)

\(\theta_i \sim \text{Beta}(\alpha, \beta)\)

Formalizing the Problem

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Level 3 – General knowledge about bags

\(\theta_i \sim \text{Beta}(\alpha, \beta)\)

Level 4 – Hyperparameters

\(\frac{\alpha}{\alpha + \beta} \sim \text{Unif}(0, 1)\)

\(\alpha + \beta \sim \text{Exp}(1)\)

Formalizing the Problem

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Level 3 – General knowledge about bags

\(\theta_i \sim \text{Beta}(\alpha, \beta)\)

Level 4 – Hyperparameters

  • Prior of Beta Distribution
  • Uniform prior of \(\frac{\alpha}{\alpha + \beta}\) implies that every average probability of drawing a black marbles is equally likely prior to seeing any data
  • Exponential distribution of \(\alpha + \beta\) implies that smaller values are more likely prior to seeing any data

\(\frac{\alpha}{\alpha + \beta} \sim \text{Unif}(0, 1)\)

\(\alpha + \beta \sim \text{Exp}(1)\)

Formalizing the Problem

We have \(i\) bags of marbles, where \(y_i\) is the number of black marbles observed and \(n_i\) is the total marbles drawn.

Level 1 – Data

\(d_i: \big\{y_i, n_i \big\}\)

Level 2 – Bag-specific distribution

\(y_i ~ \big| ~ n_i \sim \text{Binom}(\theta_i)\)

Level 3 – General knowledge about bags

\(\theta_i \sim \text{Beta}(\alpha, \beta)\)

Level 4 – Hyperparameters

\(\frac{\alpha}{\alpha + \beta} \sim \text{Unif}(0, 1)\)

\(\alpha + \beta \sim \text{Exp}(1)\)

Posterior Inference of HBM

Applying Bayes Formula to HBM

\[ \begin{gathered} \overbrace{P(\theta, \alpha, \beta ~ | ~ y)}^{\text{Posterior}} \propto \underbrace{P(\alpha, \beta)}_{\text{Hyperprior}} \overbrace{P(\theta ~ | ~ \alpha, \beta)}^{\text{Conditional Prior}} \underbrace{P(y ~ | ~ \theta, \alpha, \beta)}_{\text{Likelihood}} \end{gathered} \]

Posterior inference regarding \(\theta_i\) by integrating out \(\alpha\) and \(\beta\)

\[ \begin{align*} P(\theta_i ~ | ~ d_1, \dots, d_n) = \iint P(\theta_i ~ | ~ \alpha, \beta, d_i) P(\alpha, \beta ~ | ~ d_1, \dots, d_n) \,d\alpha \,d \beta \end{align*} \]

Implementation

Numerical integration of equation approximated using Markov Chain Monte Carlo (MCMC) methods, e.g., Hamiltonian-Monte-Carlo (HMC) using STAN:

// Beta-Binomial Hierarchical Model in STAN
data {
  int<lower=0> N;           // Number of bags
  array[N] int<lower=0> n;  // Number of marbles drawn from each bag
  array[N] int<lower=0> y;  // Number of black marbles in each bag
}

parameters {
  real<lower=0,upper=1> mu;               // Hyperparameter: mean of the Beta distribution              
  real<lower=0> phi;                      // Hyperparameter: precision of the Beta distribution               
  array[N] real<lower=0, upper=1> theta;  // Bag-specific proportion of black marbles
}

transformed parameters {
  // Reparameterization of the Beta distribution
  real<lower=0> alpha = mu * phi;
  real<lower=0> beta  = (1 - mu) * phi;
}

model {
  mu  ~ uniform(0, 1);        // Hyperprior for µ     
  phi ~ exponential(1);       // Hyperprior for ϕ   
  theta ~ beta(alpha, beta);  // Conditional prior for θ
  y ~ binomial(n, theta);     // Likelihood
}

Applying the Model to the Marbles Example

Interim Summary

How did the model solve the marble problem?

  • Concept of abstraction of inference into multiple layers of learning
  • Forming an overhypothesis from lower levels
  • Model aquired inductive bias: Homogenous history (Bags are usually one color) vs. mixed history (Bags are usually mixed)
  • Learned bias allows for string inferences about new bag given only a single observation (one-shot learning)

Application of HBMs to
Category Learning

Motivating Example

Motivating Example

A mother points to an unfamiliar object lying on the counter and tells her child that this is a pen.

Question

By which features do children generalize of a pen and recognize future instances?

  • In principle, the child could generalize the word to objects with the same material, same color, same texture, or simply objects lying on the counter
  • But empirically, children tend to generalize the new word to other objects that share the shape

Shape Bias

The expectation that members of a category tend to be similar in shape, which is learned by the age of 24 months (Smith et al., 2002).

Model Adaption

Marble World Cognitive World
Bag Category (e.g., “Dax”)
Marble Object Exemplar
Color (Black/White) Feature Value (Round/Square, Red/Blue)

The Structural Shift

Real objects aren’t just “Black or White.” They are multi-dimensional. We must expand the model from Binary (Beta-Binomial) to Multinomial (Dirichlet-Multinomial).

Application to Noun Generalization Task

Glassen & Nitsch (2016) Griffiths et al. (2024) Kemp et al. (2007)

Table 1: Training Data
1
2
3
4
Category 1 1 2 2 3 3 4 4
Shape 1 1 2 2 3 3 4 4
Texture 1 2 3 4 5 6 7 8
Color 1 2 3 4 5 6 7 8
Size 1 2 1 2 1 2 1 2
  • Two exemplar per category (columns)
  • Different feature dimensions (shape, texture, color size)
  • Pairs of objects belonging to the same category share the same shape!
Table 2: Testing Data
'Dax'
Object 1
Object 2
Object 3
Category 5 ? ? ?
Shape 5 5 6 6
Texture 9 10 9 10
Color 9 10 10 9
Size 1 1 1 1

After training, children (and the model) encounter a new object with a novel noun “dax”.

Task: Which of the three candidates with unkown label categories is most likely to be a dax?

Data based on Smith et al. (2002)

Results of Noun Generalization Task

  • 19-month-olds who received the structured training choose the shape match
  • Untrained 19-month-olds choose randomly
  • The hierarchical Bayesian model shows the same preference pattern as trained children

Summary

Test

References

Glassen, T., & Nitsch, V. (2016). Hierarchical Bayesian models of cognitive development. Biological Cybernetics, 110, 217–227. https://doi.org/10.1007/s00422-016-0686-6
Griffiths, T. L., Chater, N., & Tenenbaum, J. (2024). Bayesian Models of Cognition: Reverse Engineering the Mind. MIT Press. https://mitpress.mit.edu/9780262049412/bayesian-models-of-cognition/
Kemp, C., Perfors, A., & Tenenbaum, J. B. (2007). Learning overhypotheses with hierarchical Bayesian models. Developmental Science, 10(3), 307–321. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-7687.2007.00585.x
Smith, L. B., Jones, S. S., Landau, B., Gershkoff-Stowe, L., & Samuelson, L. (2002). Object name learning provides on-the-job training for attention. Psychological Science, 13(1), 13–19. https://doi.org/10.1111/1467-9280.00403